JESC: Japanese-English Subtitle Corpus
نویسندگان
چکیده
In this paper we describe the Japanese-English Subtitle Corpus (JESC). JESC is a large Japanese-English parallel corpus covering the underrepresented domain of conversational dialogue. It consists of more than 3.2 million examples, making it the largest freely available dataset of its kind. The corpus was assembled by crawling and aligning subtitles found on the web. The assembly process incorporates a number of novel preprocessing elements to ensure high monolingual fluency and accurate bilingual alignments. We summarize its contents and evaluate its quality using human experts and baseline machine translation (MT) systems.
منابع مشابه
The Importance of Downtoners in English Writing and Translation -with Reference to Chinese Movie Subtitle Translations
This paper reports on a study of the use of downtoners, an important hedging device, by Chinese translators when translating Chinese subtitles into English. The study was carried out by making a corpus-based analysis of patterns of using downtoners in a Chinese subtitle corpus and an English subtitle corpus. Features of overuse and underuse of downtoners in subtitle translation by the Chinese t...
متن کاملStrategies Used in the Translation of Interlingual Subtitling
This study was an attempt to identify the interlingual strategies employed to translate English subtitles into Persian and to determine their frequency, as well. Contrary to many countries, subtitling is a new field in Iran. The study, a corpus-based, comparative, descriptive, non-judgmental analysis of an English-Persian parallel corpus, comprised English audio scripts of five movies of differ...
متن کاملTranslating DVD subtitles from English-German and English-Japanese using Example-Based Machine Translation
Due to limited budgets and an ever-diminishing time-frame for the production of subtitles for movies released in cinema and DVD, there is a compelling case for a technology-based translation solution for subtitles (O’Hagan, 2003; Carroll, 2004; Gambier, 2005). In this paper we describe how an Example-Based Machine Translation (EBMT) approach to the translation of English DVD subtitles into Germ...
متن کاملA Japanese-English Patent Parallel Corpus
We describe a Japanese-English patent parallel corpus created from the Japanese and US patent data provided for the NTCIR-6 patent retrieval task. The corpus contains about 2 million sentence pairs that were aligned automatically. This is the largest Japanese-English parallel corpus, which will be available to the public after the 7th NTCIR workshop meeting. We estimated that about 97% of the s...
متن کاملAbsence of Developmental Change in Epenthetic Vowel Duration in Japanese Speakers’ English
This study examines developmental change in the production of epenthetic vowels by Japanese learners of English in relation to acquisition of L2 English speech rhythm. Seventy-two Japanese learners of English in the J-AESOP corpus were divided into lowerand higher-level learners according to their proficiency score and the frequency of vowel epenthesis. Three learners were excluded because no v...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1710.10639 شماره
صفحات -
تاریخ انتشار 2017